Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(ggfortify) #for model diagnostics
library(DHARMa)    #for residual diagnostics
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(ggeffects) #for partial effects plots
library(emmeans)   #for estimating marginal means
library(modelr)    #for auxillary modelling functions
library(tidyverse) #for data wrangling
theme_set(theme_classic())

Scenario

Here is an example from Fowler, Cohen, and Jarvis (1998). An agriculturalist was interested in the effects of fertilizer load on the yield of grass. Grass seed was sown uniformly over an area and different quantities of commercial fertilizer were applied to each of ten 1 m2 randomly located plots. Two months later the grass from each plot was harvested, dried and weighed. The data are in the file fertilizer.csv in the data folder.

fert yield
25 84
50 80
75 90
100 154
125 148
... ...
fert: Mass o f fertilizer (g.m-2) - Predictor variable
yield: Yield of grass (g.m-2) - Response variable

The aim of the analysis is to investigate the relationship between fertilizer concentration and grass yield.

Read in the data

fert = read_csv('../data/fertilizer.csv', trim_ws=TRUE)
## Parsed with column specification:
## cols(
##   FERTILIZER = col_double(),
##   YIELD = col_double()
## )
fert <- rename(fert, yield = YIELD, fert = FERTILIZER)
glimpse(fert)
## Rows: 10
## Columns: 2
## $ fert  <dbl> 25, 50, 75, 100, 125, 150, 175, 200, 225, 250
## $ yield <dbl> 84, 80, 90, 154, 148, 169, 206, 244, 212, 248
## Explore the first 6 rows of the data
head(fert)
str(fert)
## tibble [10 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fert : num [1:10] 25 50 75 100 125 150 175 200 225 250
##  $ yield: num [1:10] 84 80 90 154 148 169 206 244 212 248
##  - attr(*, "spec")=
##   .. cols(
##   ..   FERTILIZER = col_double(),
##   ..   YIELD = col_double()
##   .. )

Exploratory data analysis

ggplot(fert, aes(x = fert, y = yield)) +
  geom_smooth(method = 'lm') +
    geom_smooth(se=F, col="red") +
    geom_point()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 x_i \]

where \(y_i\) represents the \(i\) observed values, \(\beta_0\) and \(\beta_1\) represent the intercept and slope respectively, and \(\sigma^2\) represents the estimated variance.

ggplot(fert, aes(y=yield, x=fert)) +
  geom_point() +
  geom_smooth()

ggplot(fert, aes(y=yield, x=fert)) +
  geom_point() +
  geom_smooth(method='lm')

ggplot(fert, aes(y=yield)) +
  geom_boxplot(aes(x=1))

Fit the model

fert.lm<-lm(yield~1+fert, data=fert)
fert.lm<-lm(yield~fert, data=fert)
attributes(fert.lm)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
names(fert.lm)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Model validation

Check if our assumptions really are adequate!

autoplot(fert.lm, which = 1:6, ncol = 2, label.size = 3)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

  1. Residual plot Model residuals vs. fitted values/expected values
  • No patterns in the points; random scattering
  • Watch out for wedge shape, indicating a relationship between mean and variance
  1. Quantile-quantile (QQ) plot Compares your observed values to what they should look like if from a Gaussian distribution (or depending on the selected distribution), and should follow a 1:1 line
  • Often will see that the tails near the beginning or end dip over or under the line, indicating it’s not a great fit
  1. Scale-location plot
  • If we were correcting for autocorrelation, this one should appear random while the residual plot would not
  • Otherwise, just looks the same
  1. Cook’s distance Determines how influential each point is; calculated by using the residual value and overall leverage on the significant pattern
  • Values over 0.8 or 1 is considered ‘too’ influential
  • Look for either an outlier along the x- and/or y-axis
  1. Residuals vs. leverage
  • Not so useful, but shows if a residual is also high in leverage
  1. Cook’s distance vs. leverage
  • Determines if an influential point is due to high leverage (on its own) or not, based on red boundary

Model outputs

Model investigation / hypothesis testing

Predictions

Additional analyses

Summary figures

References

Fowler, J., L. Cohen, and P. Jarvis. 1998. Practical Statistics for Field Biology. England: John Wiley & Sons.